library(glmmTMB) #for model fitting
library(car) #for regression diagnostics
library(broom) #for tidy output
library(ggfortify) #for model diagnostics
library(DHARMa) #for residual diagnostics
library(see) #for plotting residuals
library(knitr) #for kable
library(effects) #for partial effects plots
library(ggeffects) #for partial effects plots
library(emmeans) #for estimating marginal means
library(modelr) #for auxillary modelling functions
library(tidyverse) #for data wrangling
library(lindia) #for diagnostics of lm and glm
library(performance) #for residuals diagnostics
library(sjPlot) #for outputs
library(report) #for reporting methods/results
library(easystats) #framework for stats, modelling and visualisation
library(MASS) #for negative binomials
library(MuMIn) #for AICcGLM Part3
1 Preparations
Load the necessary libraries
2 Scenario
Here is a modified example from Peake and Quinn (1993). Peake and Quinn (1993) investigated the relationship between the number of individuals of invertebrates living in amongst clumps of mussels on a rocky intertidal shore and the area of those mussel clumps.
| AREA | INDIV |
|---|---|
| 516.00 | 18 |
| 469.06 | 60 |
| 462.25 | 57 |
| 938.60 | 100 |
| 1357.15 | 48 |
| ... | ... |
| AREA | Area of mussel clump mm2 - Predictor variable |
| INDIV | Number of individuals found within clump - Response variable |
The aim of the analysis is to investigate the relationship between mussel clump area and the number of non-mussel invertebrate individuals supported in the mussel clump.
3 Read in the data
Rows: 25 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): AREA, INDIV
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
peake (25 rows and 2 variables, 2 shown)
ID | Name | Type | Missings | Values | N
---+-------+---------+----------+-----------------+---
1 | AREA | numeric | 0 (0.0%) | [462.25, 27144] | 25
---+-------+---------+----------+-----------------+---
2 | INDIV | numeric | 0 (0.0%) | [18, 1402] | 25
------------------------------------------------------
4 Exploratory data analysis
Model formula: \[ y_i \sim{} \mathcal{Pois}(\lambda_i)\\ ln(\lambda_i) = \beta_0 + \beta_1 ln(x_i) \]
where the number of individuals in the \(i^th\) observation is assumed to be drawn from a Poisson distribution with a \(\lambda\) (=mean) of \(\lambda_i\). The natural log of these expected values is modelled against a linear predictor that includes an intercept (\(\beta_0\)) and slope (\(\beta_i\)) for natural log transformed area. expected values are
Conclusions:
- there is some evidence of non-homogeneity of variance (observations are more tightly clustered around the trendline for small values of AREA and the spread increases throughout the trend).
- there is some evidence of non-linearity as evidenced by the substantial change in trajectory of the loess smoother.
- nevertheless, there is also evidence of non-normality in both the y-axis and x-axis (there are more points at low values of both y and x). Deviations from normality can make it difficult to assess other assumptions. Often the cause of homogeneity and linearity is non-homogeneity.
Conclusions:
- indeed both the response (
INDIV) and predictor (AREA) appear to be skewed. It is not surprising that the response is skewed as it represents counts of the number of individuals. We might expect that this should follow a Poisson distribution. Poisson distributions must be bounded by 0 at the lower end and hence as the expected value (=mean and location) of a Poisson distribution approaches 0, the distribution will become more and more asymmetrical. A the same time, the distribution would be expected to become narrower (have a smaller variance) - since in a Poisson distribution the mean and variance are equal.
Along with modelling these data against a Poisson distribution (with a natural log link), we should probably attempt to normalise the predictor variable (AREA). Whilst linear models don’t assume that the predictor variable follows a normal distribution (it is typically assumed to be a uniform distribution), it is assumed to be symmetrical. In this case, a natural log transformation might help achieve this. At the same time, it might also help homogeneity of variance and linearity.
We can mimic the action of using a log link and log-transformed predictor by presenting the scatterplot on log-transformed axes.
5 Fit the model
\[ y_i \sim{} \mathcal{N}(\mu_i, \sigma^2)\\ \mu_i = \beta_0 + \beta_1 ln(x_i) \]
Model validation
Conclusions:
- there is clear evidence of a wedge (funnel) shape in the residuals indicative of non-homogeneity of variance
- the tails of the Q-Q plot deviate substantially from the ideal line suggesting that the observed data are not normally distributed.
- there is an observation with a high Cook’s distance.
Influence measures of
lm(formula = INDIV ~ AREA, data = peake) :
dfb.1_ dfb.AREA dffit cov.r cook.d hat inf
1 -0.1999 0.13380 -0.20006 1.125 2.04e-02 0.0724
2 -0.1472 0.09887 -0.14731 1.150 1.12e-02 0.0728
3 -0.1507 0.10121 -0.15073 1.148 1.17e-02 0.0728
4 -0.1153 0.07466 -0.11549 1.155 6.92e-03 0.0687
5 -0.1881 0.11765 -0.18896 1.117 1.82e-02 0.0653
6 -0.1214 0.07306 -0.12237 1.142 7.75e-03 0.0622
7 -0.0858 0.05206 -0.08639 1.155 3.88e-03 0.0628
8 -0.0176 0.01059 -0.01776 1.165 1.65e-04 0.0621
9 -0.0509 0.02633 -0.05236 1.150 1.43e-03 0.0535
10 -0.0237 0.01069 -0.02504 1.148 3.28e-04 0.0489
11 0.0475 -0.01961 0.05096 1.141 1.35e-03 0.0470
12 -0.0418 0.01717 -0.04492 1.142 1.05e-03 0.0468
13 -0.0061 0.00221 -0.00672 1.144 2.36e-05 0.0448
14 -0.0453 0.01859 -0.04862 1.142 1.23e-03 0.0468
15 0.1641 -0.05100 0.18584 1.067 1.74e-02 0.0433
16 0.0472 -0.00254 0.06317 1.129 2.08e-03 0.0401
17 0.1764 -0.01859 0.22781 1.021 2.57e-02 0.0403
18 0.0542 0.01493 0.09093 1.120 4.28e-03 0.0411
19 0.2173 0.12011 0.43430 0.808 8.29e-02 0.0433
20 -0.0701 -0.02168 -0.12016 1.106 7.43e-03 0.0413
21 0.1849 0.63635 1.07759 0.357 3.36e-01 0.0614 *
22 0.0752 -0.33126 -0.39474 1.157 7.79e-02 0.1352
23 -0.0789 0.22611 0.25071 1.362 3.25e-02 0.2144 *
24 0.0853 -0.21744 -0.23573 1.473 2.88e-02 0.2681 *
25 0.4958 -1.32133 -1.44477 0.865 8.44e-01 0.2445 *
Conclusions:
- there is clear evidence of a wedge (funnel) shape in the residuals indicative of non-homogeneity of variance
- the tails of the Q-Q plot deviate substantially from the ideal line suggesting that the observed data are not normally distributed.
- there is an observation with a high Cook’s distance.
$uniformity
Asymptotic one-sample Kolmogorov-Smirnov test
data: simulationOutput$scaledResiduals
D = 0.18, p-value = 0.3927
alternative hypothesis: two-sided
$dispersion
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 0.96392, p-value = 0.952
alternative hypothesis: two.sided
$outliers
DHARMa outlier test based on exact binomial test with approximate
expectations
data: simulationOutput
outliers at both margin(s) = 1, observations = 25, p-value = 0.1813
alternative hypothesis: true probability of success is not equal to 0.007968127
95 percent confidence interval:
0.0010122 0.2035169
sample estimates:
frequency of outliers (expected: 0.00796812749003984 )
0.04
$uniformity
Asymptotic one-sample Kolmogorov-Smirnov test
data: simulationOutput$scaledResiduals
D = 0.18, p-value = 0.3927
alternative hypothesis: two-sided
$dispersion
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 0.96392, p-value = 0.952
alternative hypothesis: two.sided
$outliers
DHARMa outlier test based on exact binomial test with approximate
expectations
data: simulationOutput
outliers at both margin(s) = 1, observations = 25, p-value = 0.1813
alternative hypothesis: true probability of success is not equal to 0.007968127
95 percent confidence interval:
0.0010122 0.2035169
sample estimates:
frequency of outliers (expected: 0.00796812749003984 )
0.04
Conclusions:
- there is clear evidence of a wedge (funnel) shape in the residuals indicative of non-homogeneity of variance
Please go to the Poisson (GLM) tab.
\[ y_i \sim{} \mathcal{Pois}(\lambda_i)\\ ln(\lambda_i) = \beta_0 + \beta_1 ln(x_i) \]
Model validation
Conclusions:
- there is still clear evidence of a wedge (funnel) shape in the residuals indicative of non-homogeneity of variance
- the tails of the Q-Q plot deviate substantially from the ideal line suggesting that the observed data are not normally distributed.
- there is an observation with a high Cook’s distance.
Influence measures of
glm(formula = INDIV ~ log(AREA), family = poisson(link = "log"), data = peake) :
dfb.1_ dfb.l.AR dffit cov.r cook.d hat inf
1 -0.27480 0.26720 -0.2810 1.073 1.8302 0.0710 *
2 -0.04611 0.04488 -0.0471 1.173 0.0736 0.0705
3 -0.05602 0.05453 -0.0572 1.171 0.1071 0.0704
4 -0.04819 0.04647 -0.0502 1.174 0.0845 0.0720
5 -0.31731 0.30365 -0.3375 1.028 2.7532 0.0698 *
6 -0.14873 0.14126 -0.1620 1.133 0.7980 0.0668 *
7 -0.05948 0.05659 -0.0645 1.166 0.1386 0.0675
8 0.07488 -0.07110 0.0816 1.161 0.2477 0.0667
9 -0.06040 0.05587 -0.0727 1.151 0.1760 0.0575
10 -0.03779 0.03419 -0.0500 1.149 0.0848 0.0527
11 0.04660 -0.04160 0.0653 1.143 0.1555 0.0508
12 -0.06833 0.06095 -0.0961 1.133 0.3023 0.0507
13 -0.02679 0.02345 -0.0408 1.146 0.0569 0.0489
14 -0.07303 0.06515 -0.1027 1.131 0.3433 0.0507
15 0.14202 -0.12170 0.2359 1.040 2.1279 0.0477 *
16 0.01114 -0.00816 0.0302 1.145 0.0327 0.0467
17 0.10458 -0.07980 0.2557 1.018 2.4750 0.0465 *
18 0.00878 -0.00353 0.0508 1.145 0.0930 0.0501
19 0.03006 0.01642 0.4473 0.857 7.2424 0.0536 *
20 -0.04114 0.01412 -0.2614 1.028 1.9833 0.0505 *
21 -0.21773 0.31062 0.9321 0.530 25.9624 0.0742 *
22 0.27322 -0.31834 -0.5255 1.087 8.0430 0.1366 *
23 -0.06181 0.06967 0.1003 1.347 0.3584 0.1916 *
24 0.16956 -0.18902 -0.2594 1.382 2.2615 0.2255 *
25 0.84423 -0.94513 -1.3210 0.824 39.5373 0.2109 *
# Overdispersion test
dispersion ratio = 70.410
Pearson's Chi-Squared = 1619.441
p-value = < 0.001
Model has no observed zeros in the response variable.
NULL
## Note, the following cannot be piped for the plot to work!
#performance::check_normality(peake.glm) |> plot()
peake.glm |> performance::check_outliers()11 outliers detected: cases 1, 5, 6, 15, 17, 19, 20, 21, 22, 24, 25.
- Based on the following method and threshold: cook (0.714).
- For variable: (Whole model).
Conclusions:
- there is clear evidence of a wedge (funnel) shape in the residuals indicative of non-homogeneity of variance
- the tails of the Q-Q plot deviate substantially from the ideal line suggesting that the observed data are not normally distributed.
- there are observations with a high Cook’s distance.
$uniformity
Asymptotic one-sample Kolmogorov-Smirnov test
data: simulationOutput$scaledResiduals
D = 0.54099, p-value = 8.826e-07
alternative hypothesis: two-sided
$dispersion
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 105.43, p-value < 2.2e-16
alternative hypothesis: two.sided
$outliers
DHARMa bootstrapped outlier test
data: simulationOutput
outliers at both margin(s) = 13, observations = 25, p-value < 2.2e-16
alternative hypothesis: two.sided
percent confidence interval:
0.00 0.04
sample estimates:
outlier frequency (expected: 0.0112 )
0.52
$uniformity
Asymptotic one-sample Kolmogorov-Smirnov test
data: simulationOutput$scaledResiduals
D = 0.54099, p-value = 8.826e-07
alternative hypothesis: two-sided
$dispersion
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 105.43, p-value < 2.2e-16
alternative hypothesis: two.sided
$outliers
DHARMa bootstrapped outlier test
data: simulationOutput
outliers at both margin(s) = 13, observations = 25, p-value < 2.2e-16
alternative hypothesis: two.sided
percent confidence interval:
0.00 0.04
sample estimates:
outlier frequency (expected: 0.0112 )
0.52
Conclusions:
- there is clear evidence of issues with the residual plot (including outliers)
- the Q-Q plot does not closely follow the ideal line
- there is evidence of violations of distributional conformity (KS test), Dispersion and Outliers
The model appears to be over-dispersed. That is, the Poisson distribution would expect (assume) that the variance is equal to the mean. The diagnostics suggest that there is more variability in the response than the Poisson model would expect.
This could be due to:
the model being too simple. Linear models are intentionally low-dimensional representations of the system. They do not, and can not represent all the complexity in the underlying system. Instead they are intended to provide very targeted insights into a small number of potential influences. Nevertheless, the model might still be too simplistic. Perhaps if we were able to add an additional covariate, we might be able to explain the additional variation. For example, in this example, although it might be reasonable to expect that the number of individuals in a mussel clump should be driven by a Poisson process, the mussel clump area alone might not be sufficient to capture the variance reasonably. Perhaps if we were able to include additional information, such as the position of the clump along the shore (tidal influence) or orientation on the rock etc, we might be able to explain more of the currently unexplained variation.
in the absence of additional covariates, it is possible to add a special type of dummy covariate that is like a proxy for all additional covariates that we could add. This dummy variable is called a unit-level (or Observation-level) random effect. It essentially soaks up the additional variance.
another source of additional variation is when the objects being counted have a tendency to aggregate (clumped). When this is the case, the sampled data tends to be more varied since any single sample is likely to either less or more than the overall average number of items. Hence the average of the samples might turn out to be similar to a more homogeneous population, yet it will have higher variance.
The Negative Binomial distribution is able to accommodate clumpy data. Rather than assume that the variance and mean are equal (dispersion of 1), it estimates dispersion as an additional parameter. Together the two parameters of the Negative Binomial can be used to estimate the location (mean) and spread (variance) of the distribution.
yet another cause of over-dispersion is the presence of excessive zeros. In particular, some of these zeros could be false zeros. That is, a zero was recorded (no individuals observed), yet there one or more individuals were actually present (just not detected). This could imply that the observed data are generated by two processes. One that determines the actual number of items that exist and then another that determines the destructibility of the items. For these circumstances, we can fit zero-inflated models. Zero-inflated models are a mixture of two models, one representing the count process and the other representing the imperfect detection process.
Please go to the Negative Binomial (GLM) tab.
\[ y_i \sim{} \mathcal{NegBin}(\lambda_i, \theta)\\ ln(\lambda_i) = \beta_0 + \beta_1 ln(x_i) \]
Model validation
Conclusions:
- there is no longer clear evidence of a wedge (funnel) shape in the residuals suggesting that the assumption of homogeneity of variance is satisfied
- the tails of the Q-Q plot are an improvement on the Poisson.
- there are no longer any large Cook’s distances.
Influence measures of
glm.nb(formula = INDIV ~ log(AREA), data = peake, init.theta = 7.369263003, link = log) :
dfb.1_ dfb.l.AR dffit cov.r cook.d hat inf
1 -1.054875 0.987692 -1.1293 0.748 0.313001 0.1554 *
2 0.206875 -0.194302 0.2200 1.279 0.031769 0.1645 *
3 0.162776 -0.152952 0.1729 1.293 0.019220 0.1659 *
4 0.096225 -0.087632 0.1105 1.207 0.007770 0.1033
5 -0.505155 0.446373 -0.6335 0.801 0.113572 0.0777
6 -0.105801 0.090193 -0.1480 1.133 0.010967 0.0629
7 0.016821 -0.014457 0.0230 1.169 0.000318 0.0655
8 0.179367 -0.152721 0.2519 1.071 0.045529 0.0626
9 -0.010516 0.007231 -0.0249 1.142 0.000356 0.0440
10 -0.003083 0.001214 -0.0134 1.139 0.000105 0.0408
11 0.013908 -0.000181 0.0976 1.116 0.006282 0.0406
12 -0.009512 -0.000223 -0.0692 1.128 0.002572 0.0406
13 -0.000763 -0.001702 -0.0175 1.139 0.000177 0.0411
14 -0.010506 -0.000234 -0.0764 1.125 0.003099 0.0406
15 -0.008622 0.042042 0.2384 1.018 0.041811 0.0420
16 -0.006407 0.009578 0.0239 1.148 0.000345 0.0488
17 -0.054463 0.085385 0.2303 1.044 0.038438 0.0474
18 -0.009648 0.012719 0.0245 1.157 0.000363 0.0561
19 -0.145174 0.184432 0.3238 1.009 0.078107 0.0608
20 0.114479 -0.149993 -0.2847 1.029 0.033283 0.0568
21 -0.282767 0.335956 0.4884 0.932 0.183075 0.0781
22 0.304509 -0.346099 -0.4398 1.066 0.077348 0.1084
23 0.089374 -0.100144 -0.1219 1.240 0.008049 0.1270
24 0.221731 -0.247052 -0.2958 1.205 0.041771 0.1367
25 0.579077 -0.646655 -0.7793 0.904 0.187567 0.1326
Conclusions:
- there is no longer clear evidence of a wedge (funnel) shape in the residuals suggesting that the assumption of homogeneity of variance is satisfied.
- the tails of the Q-Q plot are an improvement on the Poisson.
- there are no longer any large (>0.8) Cook’s distances.
$uniformity
Exact one-sample Kolmogorov-Smirnov test
data: simulationOutput$scaledResiduals
D = 0.11352, p-value = 0.8684
alternative hypothesis: two-sided
$dispersion
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 1.1852, p-value = 0.496
alternative hypothesis: two.sided
$outliers
DHARMa bootstrapped outlier test
data: simulationOutput
outliers at both margin(s) = 0, observations = 25, p-value = 1
alternative hypothesis: two.sided
percent confidence interval:
0.00 0.04
sample estimates:
outlier frequency (expected: 0.0116 )
0
$uniformity
Exact one-sample Kolmogorov-Smirnov test
data: simulationOutput$scaledResiduals
D = 0.11352, p-value = 0.8684
alternative hypothesis: two-sided
$dispersion
DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated
data: simulationOutput
dispersion = 1.1852, p-value = 0.496
alternative hypothesis: two.sided
$outliers
DHARMa bootstrapped outlier test
data: simulationOutput
outliers at both margin(s) = 0, observations = 25, p-value = 1
alternative hypothesis: two.sided
percent confidence interval:
0.00 0.04
sample estimates:
outlier frequency (expected: 0.0116 )
0
Conclusions:
- there is no longer clear evidence of issues with the residual plot.
- the Q-Q plot is a much better match to the ideal line
- there is no longer any evidence of violations of distributional conformity (KS test), Dispersion and Outliers
The model diagnostics suggest that the Negative Binomial model is more appropriate - since it satisfied the assumptions whereas the Poisson did not. In this case, this should be the overriding factor in selecting between the two models. If however, both were found to be valid, we could use information criteria to help us chose between the two.
Information criterion provide a relative measure of the fit of the model penalising for complexity (e.i. a balance between goodness of fit and model simplicity). The actual value of an AIC by itself is of no real use. It is only useful as a comparative measure between two or more related models. For this purpose, the lower AIC is considered better.
One of the most widely used information criterion is the Akaike Information Criterion (AIC). The AIC is a measure of in-sample prediction error that penalises complexity:
\[ AIC = 2k - 2ln(\hat{L}) \] where \(k\) is the number of parameters being estimated and \(\hat{L}\) is the maximum likelihood of the fitted model.
As a general rule, if two AIC’s are within 2 units of each other, they are considered to be not significantly different from one another. In that case, you would select the model that is simplest (lower used degrees of freedom).
df AIC
peake.glm 2 1738.9471
peake.glm1 3 312.1603
## For small sample sizes, it is better to use AICc - this is
## corrected for small sample sizes.
AICc(peake.glm, peake.glm1) df AICc
peake.glm 2 1739.4926
peake.glm1 3 313.3032
Conclusions:
- purely on the basis of AIC, we would also have selected the Negative Binomial model (peake.glm1) over the Poisson model (peake.glm) as it has a substantially lower (>2) AIC.
Unfortunately, these seem to be broken at the moment…
## The following is equivalent to ggeffect
peake.glm1 |> plot_model(type = 'eff', show.data = TRUE, terms = 'AREA') ## plot_model(peake.glm1, type='eff', show.data=FALSE, terms='AREA [log]')
## plot_model(peake.glm1, type='eff', show.data=FALSE, terms='AREA [exp]')
## The following is equivalent to ggpredict
#plot_model(peake.glm1, type='pred', show_data=TRUE, terms='AREA [exp]')
## The following is equivalent to ggemmeans
## plot_model(peake.glm1, type='emm', terms='AREA [exp]')The ggpredict() |> plot() combination produces a list of ggplot objects. If we want to make adjustments to the ggplot object, we need to specify a single term in ggpredict().
6 Model investigation / hypothesis testing
Call:
glm.nb(formula = INDIV ~ log(AREA), data = peake, init.theta = 7.369263003,
link = log)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.16452 0.53387 -2.181 0.0292 *
log(AREA) 0.82469 0.06295 13.102 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(7.3693) family taken to be 1)
Null deviance: 161.076 on 24 degrees of freedom
Residual deviance: 25.941 on 23 degrees of freedom
AIC: 312.16
Number of Fisher Scoring iterations: 1
Theta: 7.37
Std. Err.: 2.16
2 x log-likelihood: -306.16
Conclusions:
- the estimated y-intercept (value of Number of Individuals when log AREA is 0) is -1.16. Note that this is still on the link (log) scale. Note also, that as a log AREA value of 0 is outside the range of the collected data, this intercept does not really have any useful interpretation (unless we are happy to extrapolate). If we had centred the Area predictor (either before modelling or as part of the model -
glm.nb(INDIV~scale(log(AREA),scale=FALSE), data=peake)), then the y-intercept would have had a sensible interpretation. It would have been the expected Individuals at the average clump Area. Nevertheless, either way, the hypothesis test associated with the intercept is rarely of any value. - the estimated slope is 0.82. For every one unit increase in log Area, the number of individuals (on a log scale) is expected to increase by 0.82 units.
- if we back transform the slope (by exponentiation), we get a slope of 2.28. This is interpreted as - for every 1 unit increase in (log) Area, the number of individuals increases 2.28 fold. That is, there is a 128 percent increase per 1 unit increase in log Area.
- the estimated slope divided by the estimated uncertainty in the slope (Std. Error) gives a t-value of 13.1. If we compare that to a t-distribution for 23, we get a p-value < 0.05. Hence we would reject the null hypothesis of no relationship between Area and number of Individuals.
- the \(R^2\) value is 0.86. Hence, 86 percent of the variation in number of Individuals is explained by its relationship to Area.
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) -2.257255 -0.04087877
log(AREA) 0.693092 0.95477695
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.1046373 0.9599455
log(AREA) 1.9998896 2.5980910
Conclusions:
- the confidence intervals provide more context about the estimated effects of Area on number of individuals. In addition to describing the implications associated with the typical effect (as indicated by the point estimate of slope), we would also discuss the implications is the effect was a low as a 2 fold increase or as high as a 2.6 fold increase.
# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -1.16 0.534 -2.18 2.92e- 2 -2.26 -0.0409
2 log(AREA) 0.825 0.0629 13.1 3.22e-39 0.693 0.955
# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.312 0.534 -2.18 2.92e- 2 0.105 0.960
2 log(AREA) 2.28 0.0629 13.1 3.22e-39 2.00 2.60
# A tibble: 1 × 8
null.deviance df.null logLik AIC BIC deviance df.residual nobs
<dbl> <int> <logLik> <dbl> <dbl> <dbl> <int> <int>
1 161. 24 -153.0802 312. 316. 25.9 23 25
7 Predictions
\[ R^2 = 1 - exp(-2/n * logL(x) - logL(0)) \] where \(logL(x)\) and \(logL(0)\) are the log-likelihoods of the fitted and null models respectively. This is then sometimes adjusted (Nagelkerke’s method) such that: \[ max(R^2) = 1 - exp(2 / n * logL(0)) \] because sometimes the max R^2$ is less than one.
8 Summary figures
## Using emmeans
peake.grid <- with(peake, list(AREA=seq(min(AREA), max(AREA), len=100)))
#OR
peake.grid <- peake |>
data_grid(AREA=seq_range(AREA, n=100))
newdata <- peake.glm1 |>
emmeans(~AREA, at = peake.grid, type = "response") |>
as.data.frame()
head(newdata) AREA response SE df asymp.LCL asymp.UCL
462.2500 49.19889 7.916538 Inf 35.89132 67.44057
731.7629 71.85811 9.773036 Inf 55.04380 93.80870
1001.2759 93.06469 11.171565 Inf 73.55395 117.75081
1270.7888 113.28088 12.317959 Inf 91.53738 140.18926
1540.3017 132.75321 13.318842 Inf 109.05506 161.60108
1809.8146 151.63405 14.238678 Inf 126.14428 182.27450
Confidence level used: 0.95
Intervals are back-transformed from the log scale
ggplot(newdata, aes(y=response, x=AREA)) +
geom_ribbon(aes(ymin=asymp.LCL, ymax=asymp.UCL),fill='blue', alpha=0.3) +
geom_line() +
theme_classic()ggplot(newdata, aes(y=response, x=AREA)) +
geom_ribbon(aes(ymin=asymp.LCL, ymax=asymp.UCL),fill='blue', alpha=0.3) +
geom_line() +
geom_point(data=peake, aes(y=INDIV)) +
scale_x_log10(breaks = as.vector(c(1,2,5,10) %o% 10^(-1:4))) +
scale_y_log10() +
theme_classic()## If we want to plot the partial observations
partial.obs <- peake.glm1 |> emmeans(~AREA, at=peake, type='response') |>
as.data.frame() |>
mutate(Partial.obs=response+resid(peake.glm1,type='response'))
## response residuals are just resid * mu.eta(predict)
ggplot(newdata, aes(y=response, x=AREA)) +
geom_ribbon(aes(ymin=asymp.LCL, ymax=asymp.UCL),fill='blue', alpha=0.3) +
geom_line() +
geom_point(data=peake, aes(y=INDIV)) +
geom_point(data=partial.obs, aes(y=Partial.obs), color='green') +
scale_x_log10(breaks = as.vector(c(1,2,5,10) %o% 10^(-1:4))) +
scale_y_log10() +
theme_classic()